To do’s
Questions
table(alltools$method)
##
## diartk ldc openSat_Sum openSat_noSum
## 976 1081 978 978
summary(alltools$DER)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 65.93 90.93 106.38 110.60 3298.96
alltools$DER[alltools$DER>100]<-NA
summary(alltools$B3F1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2252 0.5482 0.6243 0.6192 0.6894 1.0000
summary(alltools$MI)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.008839 0.048186 0.122271 0.164402 1.551829
LDC has analyzed 1081 segments, whereas opensat managed 978. The fact that the number for diartk is lower probably relates to the fact that only segments with some speech get analyzed.
As usual, DER returns some ridiculous values. Since DER is a rate, it should go from 0 to 100. We NA values above 100, namely 34% of the values.
There is nothing to be said of B3 F1. It goes from 0 to 1, as it should.
I don’t know enough about MI to say much about it, except that there seem to be some outlier values.
3 options: opensat no sum, opensat sum, ldc
ADD LINE NAMES HAVE CLIP ID, but odd looking…
The following regression does not take into account repeated measures, which could change results making them more stark, or by losing significance. It is unlikely that this will change the fact that there is a slight trend towards lower performance for openSat than ldc.
##
## Call:
## lm(formula = B3F1 ~ method, data = sad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.17397 -0.06646 -0.01473 0.04707 0.35028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.680258 0.002881 236.096 < 2e-16 ***
## methodopenSat_Sum -0.030538 0.004181 -7.305 3.54e-13 ***
## methodopenSat_noSum -0.027270 0.004181 -6.523 8.05e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09473 on 3034 degrees of freedom
## Multiple R-squared: 0.0211, Adjusted R-squared: 0.02045
## F-statistic: 32.7 on 2 and 3034 DF, p-value: 8.926e-15
##
## Call:
## lm(formula = DER ~ method, data = sad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.259 -12.925 1.285 13.680 56.095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.9047 0.6639 66.13 <2e-16 ***
## methodopenSat_Sum 34.9898 1.0642 32.88 <2e-16 ***
## methodopenSat_noSum 35.3542 1.0402 33.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.12 on 2131 degrees of freedom
## (903 observations deleted due to missingness)
## Multiple R-squared: 0.4288, Adjusted R-squared: 0.4283
## F-statistic: 799.9 on 2 and 2131 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = MI ~ method, data = sad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10268 -0.03498 -0.02294 0.01983 0.73191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.102682 0.002353 43.64 <2e-16 ***
## methodopenSat_Sum -0.063762 0.003414 -18.68 <2e-16 ***
## methodopenSat_noSum -0.067699 0.003414 -19.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07736 on 3034 degrees of freedom
## Multiple R-squared: 0.1424, Adjusted R-squared: 0.1419
## F-statistic: 252 on 2 and 3034 DF, p-value: < 2.2e-16
Next a direct comparison between the two opensat. The difference is not significant, and numerically very small. Taking into account repeated measures would be important.
##
## Welch Two Sample t-test
##
## data: B3F1 by method
## t = -0.76969, df = 1948.9, p-value = 0.4416
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.011594640 0.005058772
## sample estimates:
## mean in group openSat_Sum mean in group openSat_noSum
## 0.6497200 0.6529879
##
## Welch Two Sample t-test
##
## data: DER by method
## t = -0.38522, df = 1207.2, p-value = 0.7001
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.220224 1.491444
## sample estimates:
## mean in group openSat_Sum mean in group openSat_noSum
## 78.89456 79.25895
##
## Welch Two Sample t-test
##
## data: MI by method
## t = 1.3853, df = 1952, p-value = 0.1661
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.001636617 0.009511004
## sample estimates:
## mean in group openSat_Sum mean in group openSat_noSum
## 0.03891989 0.03498270
This dataset is not good for diarization because C1 and C2, F2 and F3, and M1 and M2 are interchangeable. Nonetheless, for what it’s worth, here are the results.
methods=dir("old_res/",pattern="txt")
oldres=NULL
for(method in methods){
read.table(paste0("old_res/",method),header=F,skip=1)->x
oldres=rbind(oldres,cbind(method,x))
}
names(oldres)<-c("method","clip","prec","rec","f1")
summary(oldres)
## method clip
## score_ldc.txt :1084 aiku_20160714_12780.rttm: 6
## score_openSAT_1+2.txt :1018 aiku_20160714_16380.rttm: 6
## score_openSAT_12.txt : 979 aiku_20160714_1980.rttm : 6
## score_openSAT_1234.txt : 979 aiku_20160714_19980.rttm: 6
## score_openSAT_123478.txt :1083 aiku_20160714_27180.rttm: 6
## score_openSAT_1234789.txt:1083 aiku_20160714_30780.rttm: 6
## (Other) :6190
## prec rec f1
## Min. :0.5000 Min. :0.5000 Min. :0.500
## 1st Qu.:0.5800 1st Qu.:0.5800 1st Qu.:0.620
## Median :0.6700 Median :0.6800 Median :0.670
## Mean :0.7011 Mean :0.7021 Mean :0.689
## 3rd Qu.:0.8000 3rd Qu.:0.8100 3rd Qu.:0.740
## Max. :1.0000 Max. :1.0000 Max. :1.000
##
For LCD, there are 3 clips more in the old results versus the latest batch.
For opensat’s, there is 1 clip more in the old results versus the latest batch, when considering 12– but there is a lot more when considering 1+2 and 123478.
Why different N’s for 12, 1+2 and 123478??
t.test(oldres$f1, sad$B3F1[sad$method=="ldc"])
##
## Welch Two Sample t-test
##
## data: oldres$f1 and sad$B3F1[sad$method == "ldc"]
## t = 2.747, df = 1512.9, p-value = 0.006086
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.002505028 0.015017034
## sample estimates:
## mean of x mean of y
## 0.6890186 0.6802576
Strange, LDC yielded better results before than it does in the latest batch…
summary(lm(f1~method+(1/clip),data=oldres))
##
## Call:
## lm(formula = f1 ~ method + (1/clip), data = oldres)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20631 -0.06340 -0.02061 0.04369 0.34939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.678044 0.002876 235.797 < 2e-16 ***
## methodscore_openSAT_1+2.txt 0.038262 0.004132 9.260 < 2e-16 ***
## methodscore_openSAT_12.txt 0.045347 0.004174 10.863 < 2e-16 ***
## methodscore_openSAT_1234.txt 0.045357 0.004174 10.866 < 2e-16 ***
## methodscore_openSAT_123478.txt -0.027435 0.004068 -6.745 1.67e-11 ***
## methodscore_openSAT_1234789.txt -0.027435 0.004068 -6.745 1.67e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09468 on 6220 degrees of freedom
## Multiple R-squared: 0.1029, Adjusted R-squared: 0.1022
## F-statistic: 142.8 on 5 and 6220 DF, p-value: < 2.2e-16
The best is _12, which is indistinguishable from _1234; a little odd that 123478 and same with 9 give exactly the same numbers. Notice that here OpenSat outperforms LDC!
Ideas for what might be harder:
read.table("../derivedFiles/line_per_segment_age.txt",header=T)->human
human$recstart=ifelse(human$recn==1,0,15*60*60)
human$clip=paste0(human$child,"_",substr(human$date,1,4),
substr(human$date,6,7),
substr(human$date,9,10),"_",
human$recstart + human$chunkstart+180
)
length(levels(factor(human$File)))
## [1] 1573
length(levels(factor(human$clip))) #the number we should end up with
## [1] 1439
aggregate(human$dur,by=list(human$clip,human$speakerID),sum)->sums
sums[sums$Group.2=="Noise","Group.1"]->withnoise
length(withnoise) #N of clips with noted noise
## [1] 345
aggregate(human$age,by=list(human$clip),mean)->age
names(age)<-c("clip","age")
dim(age) #N ok
## [1] 1439 2
aggregate(human$dur[human$type==0],by=list(human$clip[human$type==0]),sum)->durnonling
names(durnonling)<-c("clip","durnonling")
dim(durnonling)
## [1] 417 2
merge(age,durnonling,by="clip",all=T)->mytab
mytab$durnonling[is.na(mytab$durnonling)]<-0
data.frame(table(human$clip))->nsegs
names(nsegs)<-c("clip","nsegs")
merge(mytab,nsegs, all=T)->mytab
mytab$withnoise<-ifelse(mytab$clip %in% withnoise,1,0)
dim(mytab)
## [1] 1439 5
tocomp$clip=gsub(".rttm","",tocomp$clip)
merge(tocomp,mytab,all=T)->x
dim(x)
## [1] 1490 13
lm(f1.ldc ~ age+durnonling+nsegs+withnoise,data=x)->mylm
plot(mylm)
zscore=function(x) (x-mean(x, na.rm=T))/sd(x,na.rm=T)
x$age.z=zscore(x$age)
x$nsegs.z=zscore(x$nsegs)
x$durnonling.z=zscore(x$durnonling)
lm(f1.ldc ~ age.z+durnonling.z+nsegs.z+withnoise,data=x)->mylm2
plot(mylm2)
lm(f1.ldc ~ age.z+durnonling.z+nsegs.z+withnoise,data=x,subset(age.z<3,durnonling.z<3,nsegs.z<3))->mylm3
plot(mylm3)
summary(mylm3)
##
## Call:
## lm(formula = f1.ldc ~ age.z + durnonling.z + nsegs.z + withnoise,
## data = x, subset = subset(age.z < 3, durnonling.z < 3, nsegs.z <
## 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.17162 -0.06257 -0.01631 0.04874 0.31936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.679113 0.003817 177.938 <2e-16 ***
## age.z 0.001145 0.002972 0.385 0.7001
## durnonling.z -0.002899 0.002934 -0.988 0.3234
## nsegs.z 0.003959 0.004664 0.849 0.3961
## withnoise -0.016031 0.006697 -2.394 0.0169 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09175 on 923 degrees of freedom
## (562 observations deleted due to missingness)
## Multiple R-squared: 0.007652, Adjusted R-squared: 0.003352
## F-statistic: 1.779 on 4 and 923 DF, p-value: 0.1308
Interpret carefully: looks like there are a few points with too much impact. In any case, the regression overall is not significant, and a minute portion of variance is explained (less than 1 pc). The only significant predictor is the number of coded segments (i.e. complexity of the conversation).
plot(x$f1.ldc~x$nsegs,pch=20,xlim=c(0,55))
abline(lm(x$f1.ldc[x$nsegs<55]~x$nsegs[x$nsegs<55]),col="red")
boxplot(x$f1.ldc~x$withnoise,name="Performance as a function of whether there is background noise")